Personal Loan Campaign Project 4
Brandy Murray
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
Objective
Data Dictionary
Import Libraries
# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Import warnings
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libraries to help with data visualizations
import matplotlib.pyplot as plt
import seaborn as sns
# Remove the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
# To get different metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
)
# To build the Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
The nb_black extension is already loaded. To reload it, use: %reload_ext nb_black
Load and Explore Data
# Load the data into pandas dataframe
loan = pd.read_csv("Loan_Modelling_with_Coordinates.csv", dtype={"ZIPCode": object})
# Because Zipcodes give no numerical value, I am importing this dataset with ZIPCode as an object.
# Copy the data
df = loan.copy()
# Printing the shape of the data in an f-string
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")
There are 5000 rows and 16 columns.
# This shows the first 5 rows of data
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Longitude | Latitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | -118.086178 | 34.155533 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | -118.286035 | 34.020221 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | -122.254944 | 37.873832 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | -122.442950 | 37.720375 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | -118.526076 | 34.247532 |
# I'm now going to look at 10 random rows
# I'm setting the random seed via np.random.seed so that
# I get the same random results every time
np.random.seed(1)
df.sample(n=10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Longitude | Latitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2764 | 2765 | 31 | 5 | 84 | 91320 | 1 | 2.9 | 3 | 105 | 0 | 0 | 0 | 0 | 1 | -118.945127 | 34.175969 |
| 4767 | 4768 | 35 | 9 | 45 | 90639 | 3 | 0.9 | 1 | 101 | 0 | 1 | 0 | 0 | 0 | -118.015360 | 33.906790 |
| 3814 | 3815 | 34 | 9 | 35 | 94304 | 3 | 1.3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | -122.181698 | 37.374707 |
| 3499 | 3500 | 49 | 23 | 114 | 94550 | 1 | 0.3 | 1 | 286 | 0 | 0 | 0 | 1 | 0 | -121.575596 | 37.519986 |
| 2735 | 2736 | 36 | 12 | 70 | 92131 | 3 | 2.6 | 2 | 165 | 0 | 0 | 0 | 1 | 0 | -117.085982 | 32.886070 |
| 3922 | 3923 | 31 | 4 | 20 | 95616 | 4 | 1.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | -121.798467 | 38.554133 |
| 2701 | 2702 | 50 | 26 | 55 | 94305 | 1 | 1.6 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | -122.170871 | 37.418256 |
| 1179 | 1180 | 36 | 11 | 98 | 90291 | 3 | 1.2 | 3 | 0 | 0 | 1 | 0 | 0 | 1 | -118.465193 | 33.993396 |
| 932 | 933 | 51 | 27 | 112 | 94720 | 3 | 1.8 | 2 | 0 | 0 | 1 | 1 | 1 | 1 | -122.254944 | 37.873832 |
| 792 | 793 | 41 | 16 | 98 | 93117 | 1 | 4.0 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | -120.084094 | 34.479453 |
# This shows the last 5 rows of the data
df.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Longitude | Latitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | -117.840940 | 33.647320 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 | -117.250058 | 32.856347 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | -119.310133 | 34.530199 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | -118.399613 | 34.030578 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | -117.826009 | 33.658440 |
Looking at these 3 sections of rows, we can see that it'll be safe to drop the included ID column and just work with the generated ID because the numbers are the same only one off with python starting at 0.
We can also see that there are several columns that we will want to combine into groups to make less columns, such as Age, Experience, Income, and ZIPCode. I also added the Longitude and Latitude coordinates to see if we can gain any perspective from the location of these cities.
# Dropping 'ID' as stated above
df.drop("ID", axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 ZIPCode 5000 non-null object 4 Family 5000 non-null int64 5 CCAvg 5000 non-null float64 6 Education 5000 non-null int64 7 Mortgage 5000 non-null int64 8 Personal_Loan 5000 non-null int64 9 Securities_Account 5000 non-null int64 10 CD_Account 5000 non-null int64 11 Online 5000 non-null int64 12 CreditCard 5000 non-null int64 13 Longitude 5000 non-null float64 14 Latitude 5000 non-null float64 dtypes: float64(3), int64(11), object(1) memory usage: 586.1+ KB
Here we can see that all the column titles are in good shape with no '-' or '.' in the column names. I initially imported ZIPCode as an object since zipcodes do not give any numerical value. As I look at both Family and Education I belive these can be converted as well.
df.ZIPCode = df.ZIPCode.astype("category")
df.Family = df.Family.astype("category")
df.Education = df.Education.astype("category")
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 5000.0 | 45.338400 | 11.463166 | 23.000000 | 35.000000 | 45.00000 | 55.000000 | 67.000000 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.000000 | 10.000000 | 20.00000 | 30.000000 | 43.000000 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.000000 | 39.000000 | 64.00000 | 98.000000 | 224.000000 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.000000 | 0.700000 | 1.50000 | 2.500000 | 10.000000 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.000000 | 0.000000 | 0.00000 | 101.000000 | 635.000000 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 1.000000 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 1.000000 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 1.000000 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.000000 | 0.000000 | 1.00000 | 1.000000 | 1.000000 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 1.000000 |
| Longitude | 5000.0 | -120.026564 | 2.090065 | -124.107619 | -122.141700 | -119.86740 | -118.075043 | -115.675459 |
| Latitude | 5000.0 | 35.783215 | 2.099409 | 32.553021 | 34.002813 | 35.30168 | 37.750502 | 41.754853 |
Here we can see that there are no missing values with all the columns having a count of 5000. A further investigation will be needed to make sure each of those values is valid though.
Age: The average age of people in the dataset is 45, age has a range from 23 to 67 years.
Experience: The average amount of experience in years is 20. Something that will need to be further investigated is the -3 years that is showing in the min column.
Income: The average for Income is \$64,000 but there is a very large range in this column with the min being \\$8,000 and the max being \$224,000.
df.nunique()
Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 Longitude 454 Latitude 453 dtype: int64
for i in df:
print("Unique values in", i, "are :")
print(df[i].value_counts())
print("*" * 50)
Unique values in Age are :
35 151
43 149
52 145
58 143
54 143
50 138
41 136
30 136
56 135
34 134
39 133
59 132
57 132
51 129
60 127
45 127
46 127
42 126
40 125
31 125
55 125
62 123
29 123
61 122
44 121
32 120
33 120
48 118
38 115
49 115
47 113
53 112
63 108
36 107
37 106
28 103
27 91
65 80
64 78
26 78
25 53
24 28
66 24
23 12
67 12
Name: Age, dtype: int64
**************************************************
Unique values in Experience are :
32 154
20 148
9 147
5 146
23 144
35 143
25 142
28 138
18 137
19 135
26 134
24 131
3 129
14 127
16 127
30 126
34 125
27 125
17 125
29 124
22 124
7 121
8 119
6 119
15 119
10 118
33 117
13 117
11 116
37 116
36 114
21 113
4 113
31 104
12 102
38 88
39 85
2 85
1 74
0 66
40 57
41 43
-1 33
-2 15
42 8
-3 4
43 3
Name: Experience, dtype: int64
**************************************************
Unique values in Income are :
44 85
38 84
81 83
41 82
39 81
40 78
42 77
83 74
43 70
45 69
29 67
21 65
35 65
22 65
85 65
25 64
84 63
28 63
30 63
55 61
82 61
78 61
65 60
64 60
32 58
61 57
53 57
80 56
58 55
62 55
31 55
23 54
34 53
18 53
59 53
79 53
54 52
19 52
49 52
60 52
33 51
70 47
52 47
20 47
24 47
75 47
69 46
63 46
50 45
74 45
48 44
73 44
71 43
51 41
72 41
90 38
91 37
93 37
68 35
113 34
89 34
15 33
13 32
14 31
12 30
114 30
92 29
98 28
115 27
11 27
94 26
9 26
112 26
88 26
95 25
141 24
101 24
99 24
128 24
122 24
125 23
129 23
145 23
8 23
10 23
111 22
154 21
134 20
104 20
149 20
105 20
121 20
140 19
130 19
131 19
118 19
110 19
155 19
119 18
123 18
138 18
135 18
180 18
103 18
158 18
132 18
109 18
120 17
179 17
102 16
108 16
139 16
161 16
195 15
152 15
133 15
142 15
191 13
173 13
182 13
164 13
184 12
170 12
124 12
160 12
183 12
175 12
190 11
172 11
150 11
165 11
148 11
153 11
100 10
162 10
188 10
178 10
163 9
143 9
185 9
174 9
171 9
181 8
194 8
168 8
144 7
169 7
159 7
193 6
192 6
201 5
151 4
200 3
198 3
204 3
199 3
203 2
189 2
202 2
205 2
224 1
218 1
Name: Income, dtype: int64
**************************************************
Unique values in ZIPCode are :
94720 169
94305 127
95616 116
90095 71
93106 57
...
94598 1
94965 1
94970 1
90068 1
90813 1
Name: ZIPCode, Length: 467, dtype: int64
**************************************************
Unique values in Family are :
1 1472
2 1296
4 1222
3 1010
Name: Family, dtype: int64
**************************************************
Unique values in CCAvg are :
0.30 241
1.00 231
0.20 204
2.00 188
0.80 187
0.10 183
0.40 179
1.50 178
0.70 169
0.50 163
1.70 158
1.80 152
1.40 136
2.20 130
1.30 128
0.60 118
2.80 110
2.50 107
0.90 106
0.00 106
1.90 106
1.60 101
2.10 100
2.40 92
2.60 87
1.10 84
1.20 66
2.70 58
2.30 58
2.90 54
3.00 53
3.30 45
3.80 43
3.40 39
2.67 36
4.00 33
4.50 29
3.90 27
3.60 27
4.30 26
6.00 26
3.70 25
4.70 24
3.20 22
4.10 22
4.90 22
3.10 20
6.50 18
5.00 18
5.40 18
0.67 18
2.33 18
1.67 18
4.40 17
5.20 16
3.50 15
6.90 14
7.00 14
6.10 14
4.60 14
7.20 13
5.70 13
7.40 13
6.30 13
7.50 12
8.00 12
4.20 11
6.33 10
6.80 10
8.10 10
7.30 10
0.75 9
1.75 9
6.67 9
4.33 9
7.60 9
6.70 9
1.33 9
8.80 9
7.80 9
8.60 8
4.80 7
5.60 7
5.10 6
5.90 5
7.90 4
5.30 4
6.60 4
5.50 4
5.80 3
10.00 3
6.40 3
4.75 2
8.50 2
4.25 2
8.30 2
5.67 2
6.20 2
9.00 2
3.33 1
8.90 1
4.67 1
3.25 1
2.75 1
8.20 1
9.30 1
3.67 1
5.33 1
Name: CCAvg, dtype: int64
**************************************************
Unique values in Education are :
1 2096
3 1501
2 1403
Name: Education, dtype: int64
**************************************************
Unique values in Mortgage are :
0 3462
98 17
103 16
119 16
83 16
...
541 1
509 1
505 1
485 1
577 1
Name: Mortgage, Length: 347, dtype: int64
**************************************************
Unique values in Personal_Loan are :
0 4520
1 480
Name: Personal_Loan, dtype: int64
**************************************************
Unique values in Securities_Account are :
0 4478
1 522
Name: Securities_Account, dtype: int64
**************************************************
Unique values in CD_Account are :
0 4698
1 302
Name: CD_Account, dtype: int64
**************************************************
Unique values in Online are :
1 2984
0 2016
Name: Online, dtype: int64
**************************************************
Unique values in CreditCard are :
0 3530
1 1470
Name: CreditCard, dtype: int64
**************************************************
Unique values in Longitude are :
-122.254944 169
-122.170871 127
-121.798467 116
-118.443523 71
-119.847480 57
...
-122.646890 1
-122.036959 1
-122.547207 1
-120.170923 1
-118.330989 1
Name: Longitude, Length: 454, dtype: int64
**************************************************
Unique values in Latitude are :
37.873832 169
37.418256 127
38.554133 116
34.071200 71
34.414420 57
...
37.852904 1
37.351529 1
34.168771 1
33.547238 1
34.129772 1
Name: Latitude, Length: 453, dtype: int64
**************************************************
Here we can see that in Experience there are 4 counts for -3, 15 counts for -2 and 33 counts for -1. This is obviously an error because you cannot have a negative amount of experience. I thought for a second, maybe these values could refer somehow to a person being in high school still. But after looking at the ages no one is young enough. Therefore, I am going to assume the '-' is the error and remove it. Then these counts will be added to the totals for 1, 2, and 3. We can also see that under Mortgage 3,462 of the 5,000 values are 0. I am assuming that means that these people do not have a mortgage from the wording in the description. We will have to see if this high value of 69% means anything when it comes to whether or not these people purchase a loan.
# Varifying the location of the -3 so I can make sure the fix below works.
df[df["Experience"] == -3].index
Int64Index([2618, 3626, 4285, 4514], dtype='int64')
df.Experience = df.Experience.apply(lambda x: 3 if x == -3 else x)
df.Experience = df.Experience.apply(lambda x: 2 if x == -2 else x)
df.Experience = df.Experience.apply(lambda x: 1 if x == -1 else x)
df.loc[2618, "Experience"]
3
As I can tell so far the data in the columns are looking good. Before I start the analysis portion though I wanted to make some bins to help manage some of the columns that contained many different values.
# Creating the dictionary to start binning the Age Column into 5 different sections.
Age_Dict = {
"Age_20to29": list(range(20, 30)),
"Age_30to39": list(range(30, 40)),
"Age_40to49": list(range(40, 50)),
"Age_50to59": list(range(50, 60)),
"Age_60to69": list(range(60, 70)),
}
Age_Dict
{'Age_20to29': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
'Age_30to39': [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
'Age_40to49': [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
'Age_50to59': [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
'Age_60to69': [60, 61, 62, 63, 64, 65, 66, 67, 68, 69]}
# Applying the 5000 ages to the 5 sections
def age_combining(x):
if x in list(range(20, 30)):
return "Age_20to29"
elif x in list(range(30, 40)):
return "Age_30to39"
elif x in list(range(40, 50)):
return "Age_40to49"
elif x in list(range(50, 60)):
return "Age_50to59"
elif x in list(range(60, 70)):
return "Age_60to69"
else:
return x
# Creating a new column
df["Age_Bins"] = df["Age"].map(age_combining)
# From here I feel like it would be okay to drop age but I am going to wait until I am really sure
# Creating the dictionary to start binning the Experience Column into 4 different sections.
Exp_Dict = {
"Exp_0to10": list(range(0, 11)),
"Exp_11to20": list(range(11, 21)),
"Exp_21to30": list(range(21, 31)),
"Exp_30toPlus": list(range(31, 50)),
}
Exp_Dict
{'Exp_0to10': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Exp_11to20': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
'Exp_21to30': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
'Exp_30toPlus': [31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49]}
# Applying the 5000 data points for Experience to the 4 sections
def exp_combining(x):
if x in list(range(0, 11)):
return "Exp_0to10"
elif x in list(range(11, 21)):
return "Exp_11to20"
elif x in list(range(21, 31)):
return "Exp_21to30"
elif x in list(range(31, 50)):
return "Exp_30toPlus"
else:
return x
# Creating a new column
df["Exp_Bins"] = df["Experience"].map(exp_combining)
# From here I feel like it would be okay to drop Experience but I am going to wait until I am really sure
# Creating the dictionary to start binning the Income Column into 5 different sections.
Inc_Dict = {
"Inc_0to25K": list(range(0, 26)),
"Inc_26to50K": list(range(26, 51)),
"Inc_51to75K": list(range(51, 76)),
"Inc_76to100K": list(range(76, 101)),
"Inc_100K_Plus": list(range(101, 250)),
}
Inc_Dict
{'Inc_0to25K': [0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25],
'Inc_26to50K': [26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50],
'Inc_51to75K': [51,
52,
53,
54,
55,
56,
57,
58,
59,
60,
61,
62,
63,
64,
65,
66,
67,
68,
69,
70,
71,
72,
73,
74,
75],
'Inc_76to100K': [76,
77,
78,
79,
80,
81,
82,
83,
84,
85,
86,
87,
88,
89,
90,
91,
92,
93,
94,
95,
96,
97,
98,
99,
100],
'Inc_100K_Plus': [101,
102,
103,
104,
105,
106,
107,
108,
109,
110,
111,
112,
113,
114,
115,
116,
117,
118,
119,
120,
121,
122,
123,
124,
125,
126,
127,
128,
129,
130,
131,
132,
133,
134,
135,
136,
137,
138,
139,
140,
141,
142,
143,
144,
145,
146,
147,
148,
149,
150,
151,
152,
153,
154,
155,
156,
157,
158,
159,
160,
161,
162,
163,
164,
165,
166,
167,
168,
169,
170,
171,
172,
173,
174,
175,
176,
177,
178,
179,
180,
181,
182,
183,
184,
185,
186,
187,
188,
189,
190,
191,
192,
193,
194,
195,
196,
197,
198,
199,
200,
201,
202,
203,
204,
205,
206,
207,
208,
209,
210,
211,
212,
213,
214,
215,
216,
217,
218,
219,
220,
221,
222,
223,
224,
225,
226,
227,
228,
229,
230,
231,
232,
233,
234,
235,
236,
237,
238,
239,
240,
241,
242,
243,
244,
245,
246,
247,
248,
249]}
# Applying the 5000 data points for Income to the 5 sections
def inc_combining(x):
if x in list(range(0, 26)):
return "Inc_0to25K"
elif x in list(range(26, 51)):
return "Inc_26to50K"
elif x in list(range(51, 76)):
return "Inc_51to75K"
elif x in list(range(76, 101)):
return "Inc_76to100K"
elif x in list(range(101, 250)):
return "Inc_100K_Plus"
else:
return x
# Creating a new column
df["Inc_Bins"] = df["Income"].map(inc_combining)
# From here I feel like it would be okay to drop Income but I am going to wait until I am really sure
# Here I want to see what cities are are associated to each zipcode
from uszipcode import SearchEngine
search = SearchEngine(simple_zipcode=False)
city = []
for i in np.arange(0, len(df["ZIPCode"])):
zipcode = search.by_zipcode(df["ZIPCode"][i])
city.append(zipcode.major_city)
# Creating a new column with the City names
df["City"] = city
df.head()
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Longitude | Latitude | Age_Bins | Exp_Bins | Inc_Bins | City | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | -118.086178 | 34.155533 | Age_20to29 | Exp_0to10 | Inc_26to50K | Pasadena |
| 1 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | -118.286035 | 34.020221 | Age_40to49 | Exp_11to20 | Inc_26to50K | Los Angeles |
| 2 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | -122.254944 | 37.873832 | Age_30to39 | Exp_11to20 | Inc_0to25K | Berkeley |
| 3 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | -122.442950 | 37.720375 | Age_30to39 | Exp_0to10 | Inc_76to100K | San Francisco |
| 4 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | -118.526076 | 34.247532 | Age_30to39 | Exp_0to10 | Inc_26to50K | Northridge |
# Here I want to see what counties are are associated to each zipcode
county = []
for i in np.arange(0, len(df["ZIPCode"])):
zipcode = search.by_zipcode(df["ZIPCode"][i])
county.append(zipcode.county)
# Creating a new column with the County names
df["County"] = county
df.head()
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Longitude | Latitude | Age_Bins | Exp_Bins | Inc_Bins | City | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | -118.086178 | 34.155533 | Age_20to29 | Exp_0to10 | Inc_26to50K | Pasadena | Los Angeles County |
| 1 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | -118.286035 | 34.020221 | Age_40to49 | Exp_11to20 | Inc_26to50K | Los Angeles | Los Angeles County |
| 2 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | -122.254944 | 37.873832 | Age_30to39 | Exp_11to20 | Inc_0to25K | Berkeley | Alameda County |
| 3 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | -122.442950 | 37.720375 | Age_30to39 | Exp_0to10 | Inc_76to100K | San Francisco | San Francisco County |
| 4 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | -118.526076 | 34.247532 | Age_30to39 | Exp_0to10 | Inc_26to50K | Northridge | Los Angeles County |
# Here I want to see what counties are are associated to each zipcode
median_household_income = []
for i in np.arange(0, len(df["ZIPCode"])):
zipcode = search.by_zipcode(df["ZIPCode"][i])
median_household_income.append(zipcode.median_household_income)
# Creating a new column with the median_household_income amount
df["median_household_income"] = median_household_income
df.head()
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Longitude | Latitude | Age_Bins | Exp_Bins | Inc_Bins | City | County | median_household_income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | -118.086178 | 34.155533 | Age_20to29 | Exp_0to10 | Inc_26to50K | Pasadena | Los Angeles County | 80936.0 |
| 1 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | -118.286035 | 34.020221 | Age_40to49 | Exp_11to20 | Inc_26to50K | Los Angeles | Los Angeles County | 11750.0 |
| 2 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | -122.254944 | 37.873832 | Age_30to39 | Exp_11to20 | Inc_0to25K | Berkeley | Alameda County | 23304.0 |
| 3 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | -122.442950 | 37.720375 | Age_30to39 | Exp_0to10 | Inc_76to100K | San Francisco | San Francisco County | 71625.0 |
| 4 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | -118.526076 | 34.247532 | Age_30to39 | Exp_0to10 | Inc_26to50K | Northridge | Los Angeles County | NaN |
df.value_counts(["median_household_income"])
median_household_income
23304.0 169
64697.0 127
44741.0 116
99367.0 54
104665.0 53
...
67500.0 1
81743.0 1
88837.0 1
127906.0 1
114771.0 1
Length: 401, dtype: int64
Univariate Analysis
Summary Statistics of Numeric and Non-Numeric Variables
# This provides a quick summary of the numeric features
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 5000.0 | 45.338400 | 11.463166 | 23.000000 | 35.000000 | 45.00000 | 55.000000 | 67.000000 |
| Experience | 5000.0 | 20.134600 | 11.415189 | 0.000000 | 10.000000 | 20.00000 | 30.000000 | 43.000000 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.000000 | 39.000000 | 64.00000 | 98.000000 | 224.000000 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.000000 | 0.700000 | 1.50000 | 2.500000 | 10.000000 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.000000 | 0.000000 | 0.00000 | 101.000000 | 635.000000 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 1.000000 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 1.000000 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 1.000000 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.000000 | 0.000000 | 1.00000 | 1.000000 | 1.000000 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 1.000000 |
| Longitude | 5000.0 | -120.026564 | 2.090065 | -124.107619 | -122.141700 | -119.86740 | -118.075043 | -115.675459 |
| Latitude | 5000.0 | 35.783215 | 2.099409 | 32.553021 | 34.002813 | 35.30168 | 37.750502 | 41.754853 |
| median_household_income | 4205.0 | 74933.713436 | 31751.406294 | 11750.000000 | 54166.000000 | 71969.00000 | 94479.000000 | 250001.000000 |
# While doing univariate analysis of numerical variables we want to study their central tendency and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical variable.
# This function takes the numerical column as the input and returns the boxplots and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
def histogram_boxplot(feature, figsize=(7, 12), bins=None):
"""Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
sns.set(font_scale=2) # setting the font scale for seaborn
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid=2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(
feature, kde=F, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.distplot(
feature, kde=False, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
feature.mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
feature.median(), color="black", linestyle="-"
) # Add median to the histogram
num_col = df.select_dtypes(include=np.number).columns.tolist()
for col in num_col:
histogram_boxplot(df[col])
Here we can see that there are some spikes in certain ages and experience. If it wasn't for these spikes, the data would be fairly flat across the top. Income is skewed to the right which we saw before when more than 75% of the income was below \$98,000 but the max value spiked up to $214,000. CCAvg is also right skewed. Mortgage is displayed very oddly since there are 3,462 zeros. Personal Loan, Securities Account, CD Account, Online, Credit Card are all boolean values. I did think that the median household income was really interesting. It appears to be almost normally distributed with a slight right skewedness.
# lets plot histogram of all numerical variables
all_col = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(17, 75))
for i in range(len(all_col)):
plt.subplot(18, 3, i + 1)
# plt.hist(df[all_col[i]])
sns.histplot(
df[all_col[i]], kde=True
) # you can comment the previous line and run this one to get distribution curves
plt.tight_layout()
plt.title(all_col[i], fontsize=25)
plt.show()
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
for col in num_col:
labeled_barplot(df, col, perc=True)
labeled_barplot(df, "Age_Bins", perc=True)
labeled_barplot(df, "Exp_Bins", perc=True)
labeled_barplot(df, "Inc_Bins", perc=True)
labeled_barplot(df, "County", perc=True)
import folium
# Coordinates for base map of US
USA = [36.7783, -119.4179]
Map = folium.Map(USA, zoom_start=6, tiles="Stamen Terrain")
for i in range(0, len(df)):
folium.Marker(
[df.iloc[i]["Latitude"], df.iloc[i]["Longitude"]],
popup=[
df.iloc[i]["City"],
df.iloc[i]["County"],
df.iloc[i]["median_household_income"],
],
).add_to(Map)
Map
Log Transformation
I wanted to see what would happen if I tried to take the log of Mortgage. Here we can see the different option didn't change the way the distribution looked. I thought this might happen but I wanted to try and see.
plt.hist(df["Mortgage"], 50)
plt.title("Mortgage")
plt.show()
plt.hist(np.log(df["Mortgage"] + 1), 50)
plt.title("log(Mortgage + 1)")
plt.show()
plt.hist(np.arcsinh(df["Mortgage"]), 50)
plt.title("arcsinh(Mortgage)")
plt.show()
plt.hist(np.sqrt(df["Mortgage"]), 50)
plt.title("sqrt(Mortgage)")
plt.show()
Bivariate Analysis
sns.pairplot(data=df)
<seaborn.axisgrid.PairGrid at 0x2863c23ef88>
df.corr()
| Age | Experience | Income | CCAvg | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Longitude | Latitude | median_household_income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.000000 | 0.993991 | -0.055269 | -0.052012 | -0.012539 | -0.007726 | -0.000436 | 0.008043 | 0.013702 | 0.007681 | 0.018560 | -0.024288 | 0.024439 |
| Experience | 0.993991 | 1.000000 | -0.046876 | -0.049738 | -0.011097 | -0.008304 | -0.000989 | 0.009735 | 0.014051 | 0.008851 | 0.019530 | -0.025024 | 0.024099 |
| Income | -0.055269 | -0.046876 | 1.000000 | 0.645984 | 0.206806 | 0.502462 | -0.002616 | 0.169738 | 0.014206 | -0.002385 | 0.016684 | -0.026597 | 0.039179 |
| CCAvg | -0.052012 | -0.049738 | 0.645984 | 1.000000 | 0.109905 | 0.366889 | 0.015086 | 0.136534 | -0.003611 | -0.006689 | 0.006470 | -0.015820 | 0.025563 |
| Mortgage | -0.012539 | -0.011097 | 0.206806 | 0.109905 | 1.000000 | 0.142095 | -0.005411 | 0.089311 | -0.005995 | -0.007231 | 0.002402 | -0.001526 | 0.003189 |
| Personal_Loan | -0.007726 | -0.008304 | 0.502462 | 0.366889 | 0.142095 | 1.000000 | 0.021954 | 0.316355 | 0.006278 | 0.002802 | -0.001711 | -0.006482 | -0.000318 |
| Securities_Account | -0.000436 | -0.000989 | -0.002616 | 0.015086 | -0.005411 | 0.021954 | 1.000000 | 0.317034 | 0.012627 | -0.015028 | -0.002653 | -0.002881 | 0.006124 |
| CD_Account | 0.008043 | 0.009735 | 0.169738 | 0.136534 | 0.089311 | 0.316355 | 0.317034 | 1.000000 | 0.175880 | 0.278644 | -0.031453 | 0.025593 | -0.013494 |
| Online | 0.013702 | 0.014051 | 0.014206 | -0.003611 | -0.005995 | 0.006278 | 0.012627 | 0.175880 | 1.000000 | 0.004210 | -0.019259 | 0.030391 | 0.008912 |
| CreditCard | 0.007681 | 0.008851 | -0.002385 | -0.006689 | -0.007231 | 0.002802 | -0.015028 | 0.278644 | 0.004210 | 1.000000 | -0.020718 | 0.015257 | 0.011088 |
| Longitude | 0.018560 | 0.019530 | 0.016684 | 0.006470 | 0.002402 | -0.001711 | -0.002653 | -0.031453 | -0.019259 | -0.020718 | 1.000000 | -0.948899 | -0.052726 |
| Latitude | -0.024288 | -0.025024 | -0.026597 | -0.015820 | -0.001526 | -0.006482 | -0.002881 | 0.025593 | 0.030391 | 0.015257 | -0.948899 | 1.000000 | -0.035227 |
| median_household_income | 0.024439 | 0.024099 | 0.039179 | 0.025563 | 0.003189 | -0.000318 | 0.006124 | -0.013494 | 0.008912 | 0.011088 | -0.052726 | -0.035227 | 1.000000 |
Here were can see that Age and Experience are highly coorelated. It looks like Income is highly coorelated with CCAvg and Mortgage, but when we look at the numbers the correlation is only a 0.65 and 0.21 respectively.
df.corr()["Personal_Loan"]
df.corr().sort_values(by="Personal_Loan", ascending=False)["Personal_Loan"]
Personal_Loan 1.000000 Income 0.502462 CCAvg 0.366889 CD_Account 0.316355 Mortgage 0.142095 Securities_Account 0.021954 Online 0.006278 CreditCard 0.002802 median_household_income -0.000318 Longitude -0.001711 Latitude -0.006482 Age -0.007726 Experience -0.008304 Name: Personal_Loan, dtype: float64
Looking at the coorelation from Personal Loan shows there are not any super highly coorelated variables.
# However we want to see correlation in graphical representation so below is function for that
def plot_corr(df, size=50):
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr)
plt.xticks(range(len(corr.columns)), corr.columns)
plt.yticks(range(len(corr.columns)), corr.columns)
for (i, j), z in np.ndenumerate(corr):
ax.text(j, i, "{:0.1f}".format(z), ha="center", va="center")
plot_corr(df)
plt.figure(figsize=(15, 7))
sns.lineplot(x="CCAvg", y="Age", data=df, ci=None)
<AxesSubplot:xlabel='CCAvg', ylabel='Age'>
plt.figure(figsize=(10, 15))
sns.barplot(data=df, x="median_household_income", y="County")
<AxesSubplot:xlabel='median_household_income', ylabel='County'>
plt.figure(figsize=(10, 8))
sns.barplot(data=df, x="Exp_Bins", y="Experience", hue="Education")
<AxesSubplot:xlabel='Exp_Bins', ylabel='Experience'>
plt.figure(figsize=(10, 8))
sns.barplot(data=df, x="Exp_Bins", y="Experience", hue="Family")
<AxesSubplot:xlabel='Exp_Bins', ylabel='Experience'>
plt.figure(figsize=(10, 8))
sns.barplot(data=df, x="Age_Bins", y="Experience", hue="Education")
<AxesSubplot:xlabel='Age_Bins', ylabel='Experience'>
plt.figure(figsize=(10, 8))
sns.barplot(data=df, x="Age_Bins", y="Experience", hue="Family")
<AxesSubplot:xlabel='Age_Bins', ylabel='Experience'>
Model can make wrong predictions as:
**Which loss is greater?
How to reduce this loss?
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Data Preparation
# checking for null values
df["median_household_income"].isnull().sum()
795
# Dropping 'median_household_income' because I created it and there are null values
df.drop("median_household_income", axis=1, inplace=True)
# Creating a copy so it is easier to go back and do a different model dropping different variables.
df2 = df.copy()
X = df2.drop("Personal_Loan", axis=1)
y = df2["Personal_Loan"]
# creating dummy variables
X = pd.get_dummies(X, drop_first=True)
# splitting in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
model = LogisticRegression(solver="newton-cg", random_state=1)
lg = model.fit(X_train, y_train)
# predicting on training set
y_pred_train = lg.predict(X_train)
print("Training set performance:")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision:", precision_score(y_train, y_pred_train))
print("Recall:", recall_score(y_train, y_pred_train))
print("F1:", f1_score(y_train, y_pred_train))
Training set performance: Accuracy: 0.9788571428571429 Precision: 0.9446366782006921 Recall: 0.824773413897281 F1: 0.8806451612903228
# predicting on the test set
y_pred_test = lg.predict(X_test)
print("Test set performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision:", precision_score(y_test, y_pred_test))
print("Recall:", recall_score(y_test, y_pred_test))
print("F1:", f1_score(y_test, y_pred_test))
Test set performance: Accuracy: 0.9553333333333334 Precision: 0.8867924528301887 Recall: 0.6308724832214765 F1: 0.7372549019607842
Observations
The training and testing recall are 82.4% and 63.1% respectively.
Recall on the train and test sets are not comparable.
This shows that the model is not giving a generalized result.
X = df2.drop("Personal_Loan", axis=1)
y = df2["Personal_Loan"]
# creating dummy variables
X = pd.get_dummies(X, drop_first=True)
# adding constant
X = sm.add_constant(X)
# splitting in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(
disp=False
) # setting disp=False will remove the information on number of iterations
print(lg.summary())
--------------------------------------------------------------------------- LinAlgError Traceback (most recent call last) <ipython-input-69-638eec83876e> in <module> 1 logit = sm.Logit(y_train, X_train.astype(float)) 2 lg = logit.fit( ----> 3 disp=False 4 ) # setting disp=False will remove the information on number of iterations 5 C:\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 1978 disp=disp, 1979 callback=callback, -> 1980 **kwargs) 1981 1982 discretefit = LogitResults(self, bnryfit) C:\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 231 disp=disp, 232 callback=callback, --> 233 **kwargs) 234 235 return mlefit # It is up to subclasses to wrap results C:\Anaconda\lib\site-packages\statsmodels\base\model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs) 532 Hinv = cov_params_func(self, xopt, retvals) 533 elif method == 'newton' and full_output: --> 534 Hinv = np.linalg.inv(-retvals['Hessian']) / nobs 535 elif not skip_hessian: 536 H = -1 * self.hessian(xopt) <__array_function__ internals> in inv(*args, **kwargs) C:\Anaconda\lib\site-packages\numpy\linalg\linalg.py in inv(a) 544 signature = 'D->D' if isComplexType(t) else 'd->d' 545 extobj = get_linalg_error_extobj(_raise_linalgerror_singular) --> 546 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj) 547 return wrap(ainv.astype(result_t, copy=False)) 548 C:\Anaconda\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_singular(err, flag) 86 87 def _raise_linalgerror_singular(err, flag): ---> 88 raise LinAlgError("Singular matrix") 89 90 def _raise_linalgerror_nonposdef(err, flag): LinAlgError: Singular matrix
Observations
# creating a new copy to reduce multicollinearity
df3 = df2.copy()
# Dropping variable Latitude and Longitude since I created these for the map
df3.drop("Latitude", axis=1, inplace=True)
df3.drop("Longitude", axis=1, inplace=True)
df3.drop("Experience", axis=1, inplace=True)
After running the VIF test, the first time I found that Age and Experience had a very high score. Therefore, I dropped Experience.
df3.drop("Age_Bins", axis=1, inplace=True)
df3.drop("Exp_Bins", axis=1, inplace=True)
df3.drop("Inc_Bins", axis=1, inplace=True)
There were also high scores between the bins I created and the original columns. So first I am trying to drop the bin columns I created.
df3.drop("City", axis=1, inplace=True)
df3.drop("ZIPCode", axis=1, inplace=True)
After that, there was a high score between City, County, and ZIPCode.
X = df3.drop("Personal_Loan", axis=1)
y = df3["Personal_Loan"]
# creating dummy variables
X = pd.get_dummies(X, drop_first=True)
# adding constant
X = sm.add_constant(X)
# splitting in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(
disp=False
) # setting disp=False will remove the information on number of iterations
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 3500
Model: Logit Df Residuals: 3449
Method: MLE Df Model: 50
Date: Sat, 31 Jul 2021 Pseudo R-squ.: 0.6692
Time: 09:10:03 Log-Likelihood: -362.37
converged: False LL-Null: -1095.5
Covariance Type: nonrobust LLR p-value: 4.051e-274
=================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------
const -13.8852 0.925 -15.014 0.000 -15.698 -12.073
Age 0.0046 0.009 0.521 0.602 -0.013 0.022
Income 0.0646 0.004 16.098 0.000 0.057 0.073
CCAvg 0.2495 0.061 4.061 0.000 0.129 0.370
Mortgage 0.0010 0.001 1.286 0.198 -0.001 0.003
Securities_Account -1.0765 0.420 -2.562 0.010 -1.900 -0.253
CD_Account 3.8828 0.458 8.483 0.000 2.986 4.780
Online -0.6575 0.214 -3.078 0.002 -1.076 -0.239
CreditCard -1.1321 0.285 -3.973 0.000 -1.691 -0.574
Family_2 0.0894 0.304 0.294 0.769 -0.506 0.685
Family_3 2.6590 0.332 8.020 0.000 2.009 3.309
Family_4 1.7490 0.322 5.427 0.000 1.117 2.381
Education_2 4.0583 0.357 11.378 0.000 3.359 4.757
Education_3 4.3228 0.355 12.167 0.000 3.626 5.019
County_Butte County -21.8923 1.7e+05 -0.000 1.000 -3.33e+05 3.33e+05
County_Contra Costa County 0.2788 0.842 0.331 0.740 -1.371 1.928
County_El Dorado County -0.7732 1.434 -0.539 0.590 -3.584 2.038
County_Fresno County -1.0976 2.289 -0.480 0.632 -5.583 3.388
County_Humboldt County -1.2518 1.788 -0.700 0.484 -4.756 2.253
County_Imperial County -15.0033 2.81e+04 -0.001 1.000 -5.52e+04 5.52e+04
County_Kern County 1.5255 0.783 1.949 0.051 -0.008 3.059
County_Lake County -18.1981 8.36e+04 -0.000 1.000 -1.64e+05 1.64e+05
County_Los Angeles County -0.0011 0.371 -0.003 0.998 -0.729 0.727
County_Marin County 0.3874 0.913 0.424 0.671 -1.402 2.177
County_Mendocino County -2.3533 4.593 -0.512 0.608 -11.355 6.648
County_Merced County -10.4431 441.561 -0.024 0.981 -875.886 855.000
County_Monterey County -0.1041 0.698 -0.149 0.881 -1.472 1.264
County_Napa County -9.4178 2022.553 -0.005 0.996 -3973.548 3954.712
County_Orange County 0.1550 0.490 0.317 0.752 -0.805 1.115
County_Placer County 1.0965 1.018 1.077 0.282 -0.899 3.092
County_Riverside County 2.1643 0.823 2.629 0.009 0.551 3.778
County_Sacramento County 0.0853 0.595 0.143 0.886 -1.082 1.252
County_San Benito County -14.1809 3748.145 -0.004 0.997 -7360.409 7332.047
County_San Bernardino County -1.0360 1.110 -0.933 0.351 -3.212 1.140
County_San Diego County 0.1006 0.427 0.236 0.814 -0.735 0.937
County_San Francisco County 0.2212 0.531 0.417 0.677 -0.820 1.262
County_San Joaquin County -0.2417 7.341 -0.033 0.974 -14.631 14.147
County_San Luis Obispo County -1.5861 2.216 -0.716 0.474 -5.929 2.757
County_San Mateo County -1.3032 0.662 -1.967 0.049 -2.602 -0.005
County_Santa Barbara County 0.4584 0.633 0.724 0.469 -0.783 1.700
County_Santa Clara County 0.2266 0.421 0.538 0.591 -0.599 1.052
County_Santa Cruz County -0.0094 0.853 -0.011 0.991 -1.681 1.662
County_Shasta County -4.4861 10.022 -0.448 0.654 -24.128 15.156
County_Siskiyou County -41.7476 5.09e+09 -8.2e-09 1.000 -9.98e+09 9.98e+09
County_Solano County 1.0378 1.042 0.996 0.319 -1.004 3.079
County_Sonoma County 1.3138 1.154 1.139 0.255 -0.947 3.575
County_Stanislaus County -12.5303 550.129 -0.023 0.982 -1090.763 1065.702
County_Trinity County -11.7751 1021.519 -0.012 0.991 -2013.915 1990.365
County_Tuolumne County -21.5616 1.64e+05 -0.000 1.000 -3.21e+05 3.21e+05
County_Ventura County 0.1032 0.646 0.160 0.873 -1.163 1.369
County_Yolo County -0.4848 0.767 -0.632 0.527 -1.987 1.018
=================================================================================================
Possibly complete quasi-separation: A fraction 0.17 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Observations
Negative values of the coefficient shows that probability of customer not purchasing a loan decreases with the increase of corresponding attribute value.
Positive values of the coefficient show that that probability of customer purchasing a loan increases with the increase of corresponding attribute value.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.
But these variables might contain multicollinearity, which will affect the p-values.
We will have to remove multicollinearity from the data to get reliable coefficients and p-values.
There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.
Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearity that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is "inflated" by the existence of correlation among the predictor variables in the model.
General Rule of thumb: If VIF is 1 then there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of β̂k is not inflated at all. Whereas if VIF exceeds 5, we say there is moderate VIF and if it is 10 or exceeding 10, it shows signs of high multi-collinearity. But the purpose of the analysis should dictate which threshold to use.
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: const 36.209113 Age 1.026204 Income 1.913727 CCAvg 1.755454 Mortgage 1.060962 Securities_Account 1.157801 CD_Account 1.373245 Online 1.055760 CreditCard 1.125206 Family_2 1.420281 Family_3 1.396617 Family_4 1.441933 Education_2 1.302912 Education_3 1.264973 County_Butte County 1.033779 County_Contra Costa County 1.131617 County_El Dorado County 1.028104 County_Fresno County 1.032622 County_Humboldt County 1.059178 County_Imperial County 1.006696 County_Kern County 1.088149 County_Lake County 1.013082 County_Los Angeles County 2.322558 County_Marin County 1.085842 County_Mendocino County 1.023649 County_Merced County 1.009846 County_Monterey County 1.202414 County_Napa County 1.010066 County_Orange County 1.518167 County_Placer County 1.043319 County_Riverside County 1.083379 County_Sacramento County 1.303141 County_San Benito County 1.028236 County_San Bernardino County 1.169006 County_San Diego County 1.759683 County_San Francisco County 1.406837 County_San Joaquin County 1.015351 County_San Luis Obispo County 1.049200 County_San Mateo County 1.317495 County_Santa Barbara County 1.228196 County_Santa Clara County 1.777333 County_Santa Cruz County 1.117850 County_Shasta County 1.020468 County_Siskiyou County 1.013893 County_Solano County 1.063769 County_Sonoma County 1.058558 County_Stanislaus County 1.026127 County_Trinity County 1.009929 County_Tuolumne County 1.011877 County_Ventura County 1.180697 County_Yolo County 1.195133 dtype: float64
With the multicollinearity removed, I want to try the df2 dataframe on the Logistic Regression (with Sklearn library)
X = df3.drop("Personal_Loan", axis=1)
Y = df3["Personal_Loan"]
# creating dummy variables
X = pd.get_dummies(X, drop_first=True)
# splitting in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
model = LogisticRegression(solver="newton-cg", random_state=1)
lg = model.fit(X_train, y_train)
# predicting on training set
y_pred_train = lg.predict(X_train)
print("Training set performance:")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision:", precision_score(y_train, y_pred_train))
print("Recall:", recall_score(y_train, y_pred_train))
print("F1:", f1_score(y_train, y_pred_train))
Training set performance: Accuracy: 0.9625714285714285 Precision: 0.890625 Recall: 0.6888217522658611 F1: 0.776831345826235
# predicting on the test set
y_pred_test = lg.predict(X_test)
print("Test set performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision:", precision_score(y_test, y_pred_test))
print("Recall:", recall_score(y_test, y_pred_test))
print("F1:", f1_score(y_test, y_pred_test))
Test set performance: Accuracy: 0.954 Precision: 0.9081632653061225 Recall: 0.5973154362416108 F1: 0.7206477732793521
Observations
The training and testing recall are 68.9% and 59.7% respectively.
Recall on the train and test sets are comparable.
This shows that the model is giving a generalised result.
Now, since none of the variables exhibit high multicollinearity, we know the values in the summary are reliable. Now we can remove the insignificant features. To start with I am going to remove the 4 counties that had a p-value of 1.000.
X_train1 = X_train.drop(
[
"County_Tuolumne County",
"County_Siskiyou County",
"County_Lake County",
"County_Butte County",
],
axis=1,
)
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(disp=False)
print(lg1.summary())
--------------------------------------------------------------------------- LinAlgError Traceback (most recent call last) <ipython-input-85-5227cc186a33> in <module> 1 logit1 = sm.Logit(y_train, X_train1.astype(float)) ----> 2 lg1 = logit1.fit(disp=False) 3 print(lg1.summary()) C:\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 1978 disp=disp, 1979 callback=callback, -> 1980 **kwargs) 1981 1982 discretefit = LogitResults(self, bnryfit) C:\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 231 disp=disp, 232 callback=callback, --> 233 **kwargs) 234 235 return mlefit # It is up to subclasses to wrap results C:\Anaconda\lib\site-packages\statsmodels\base\model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs) 532 Hinv = cov_params_func(self, xopt, retvals) 533 elif method == 'newton' and full_output: --> 534 Hinv = np.linalg.inv(-retvals['Hessian']) / nobs 535 elif not skip_hessian: 536 H = -1 * self.hessian(xopt) <__array_function__ internals> in inv(*args, **kwargs) C:\Anaconda\lib\site-packages\numpy\linalg\linalg.py in inv(a) 544 signature = 'D->D' if isComplexType(t) else 'd->d' 545 extobj = get_linalg_error_extobj(_raise_linalgerror_singular) --> 546 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj) 547 return wrap(ainv.astype(result_t, copy=False)) 548 C:\Anaconda\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_singular(err, flag) 86 87 def _raise_linalgerror_singular(err, flag): ---> 88 raise LinAlgError("Singular matrix") 89 90 def _raise_linalgerror_nonposdef(err, flag): LinAlgError: Singular matrix
Hmmmm, I have multicolinarity problems again.
vif_series = pd.Series(
[variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])],
index=X_train1.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: Age 8.936005 Income 6.321509 CCAvg 3.924280 Mortgage 1.385515 Securities_Account 1.278596 CD_Account 1.427649 Online 2.450954 CreditCard 1.568050 Family_2 1.839717 Family_3 1.672866 Family_4 1.767683 Education_2 1.712631 Education_3 1.721033 County_Contra Costa County 1.106053 County_El Dorado County 1.024077 County_Fresno County 1.027945 County_Humboldt County 1.045321 County_Imperial County 1.006319 County_Kern County 1.070583 County_Los Angeles County 2.459142 County_Marin County 1.076028 County_Mendocino County 1.021045 County_Merced County 1.007638 County_Monterey County 1.178117 County_Napa County 1.006273 County_Orange County 1.446434 County_Placer County 1.035261 County_Riverside County 1.064665 County_Sacramento County 1.246417 County_San Benito County 1.020784 County_San Bernardino County 1.140195 County_San Diego County 1.695394 County_San Francisco County 1.352123 County_San Joaquin County 1.013789 County_San Luis Obispo County 1.042369 County_San Mateo County 1.262111 County_Santa Barbara County 1.189434 County_Santa Clara County 1.714886 County_Santa Cruz County 1.090563 County_Shasta County 1.016466 County_Solano County 1.048724 County_Sonoma County 1.056365 County_Stanislaus County 1.019422 County_Trinity County 1.008441 County_Ventura County 1.164472 County_Yolo County 1.162851 dtype: float64
Here we can see that age has the highest score over 5.
X_train1 = X_train.drop(
[
"Age",
],
axis=1,
)
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(disp=False)
print(lg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 3500
Model: Logit Df Residuals: 3451
Method: MLE Df Model: 48
Date: Sat, 31 Jul 2021 Pseudo R-squ.: 0.2444
Time: 09:10:25 Log-Likelihood: -827.73
converged: False LL-Null: -1095.5
Covariance Type: nonrobust LLR p-value: 1.547e-83
=================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------
Income 0.0173 0.002 10.533 0.000 0.014 0.021
CCAvg 0.1009 0.041 2.477 0.013 0.021 0.181
Mortgage -4.782e-05 0.001 -0.086 0.932 -0.001 0.001
Securities_Account -2.0776 0.293 -7.090 0.000 -2.652 -1.503
CD_Account 4.8485 0.303 15.980 0.000 4.254 5.443
Online -1.7560 0.139 -12.596 0.000 -2.029 -1.483
CreditCard -2.0962 0.208 -10.076 0.000 -2.504 -1.688
Family_2 -1.5988 0.182 -8.771 0.000 -1.956 -1.242
Family_3 -0.4730 0.166 -2.845 0.004 -0.799 -0.147
Family_4 -0.7476 0.172 -4.343 0.000 -1.085 -0.410
Education_2 0.3166 0.156 2.024 0.043 0.010 0.623
Education_3 0.3282 0.147 2.231 0.026 0.040 0.617
County_Butte County -18.7473 5313.954 -0.004 0.997 -1.04e+04 1.04e+04
County_Contra Costa County -2.2725 0.500 -4.544 0.000 -3.253 -1.292
County_El Dorado County -2.6294 1.399 -1.880 0.060 -5.371 0.112
County_Fresno County -3.1295 1.140 -2.746 0.006 -5.363 -0.896
County_Humboldt County -2.8490 1.073 -2.655 0.008 -4.952 -0.746
County_Imperial County -28.1673 3.1e+06 -9.07e-06 1.000 -6.08e+06 6.08e+06
County_Kern County -1.7671 0.508 -3.476 0.001 -2.763 -0.771
County_Lake County -30.6803 3.85e+06 -7.97e-06 1.000 -7.55e+06 7.55e+06
County_Los Angeles County -2.7602 0.192 -14.340 0.000 -3.137 -2.383
County_Marin County -2.6824 0.616 -4.354 0.000 -3.890 -1.475
County_Mendocino County -3.1282 1.261 -2.481 0.013 -5.599 -0.657
County_Merced County -20.0028 6929.367 -0.003 0.998 -1.36e+04 1.36e+04
County_Monterey County -2.5066 0.401 -6.254 0.000 -3.292 -1.721
County_Napa County -36.8338 4.69e+07 -7.85e-07 1.000 -9.2e+07 9.2e+07
County_Orange County -3.0760 0.303 -10.137 0.000 -3.671 -2.481
County_Placer County -1.9471 0.899 -2.165 0.030 -3.710 -0.184
County_Riverside County -2.7409 0.683 -4.012 0.000 -4.080 -1.402
County_Sacramento County -2.7550 0.368 -7.494 0.000 -3.476 -2.035
County_San Benito County -19.8590 1.05e+04 -0.002 0.998 -2.06e+04 2.05e+04
County_San Bernardino County -4.9009 1.051 -4.662 0.000 -6.961 -2.840
County_San Diego County -2.5559 0.230 -11.131 0.000 -3.006 -2.106
County_San Francisco County -2.9823 0.337 -8.853 0.000 -3.643 -2.322
County_San Joaquin County -1.3757 1.266 -1.087 0.277 -3.857 1.105
County_San Luis Obispo County -2.8139 0.851 -3.306 0.001 -4.482 -1.146
County_San Mateo County -3.8750 0.452 -8.576 0.000 -4.761 -2.989
County_Santa Barbara County -2.6086 0.406 -6.421 0.000 -3.405 -1.812
County_Santa Clara County -2.6105 0.238 -10.990 0.000 -3.076 -2.145
County_Santa Cruz County -2.6793 0.546 -4.908 0.000 -3.749 -1.609
County_Shasta County -3.5358 1.659 -2.132 0.033 -6.786 -0.285
County_Siskiyou County -20.5180 2.15e+04 -0.001 0.999 -4.22e+04 4.22e+04
County_Solano County -2.8396 0.819 -3.469 0.001 -4.444 -1.235
County_Sonoma County -1.3676 0.722 -1.894 0.058 -2.783 0.048
County_Stanislaus County -20.4985 9020.548 -0.002 0.998 -1.77e+04 1.77e+04
County_Trinity County -16.1658 1393.621 -0.012 0.991 -2747.612 2715.281
County_Tuolumne County -20.3609 2.83e+04 -0.001 0.999 -5.54e+04 5.54e+04
County_Ventura County -2.6373 0.455 -5.795 0.000 -3.529 -1.745
County_Yolo County -3.2489 0.565 -5.748 0.000 -4.357 -2.141
=================================================================================================
vif_series = pd.Series(
[variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])],
index=X_train1.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: Income 6.072085 CCAvg 3.926404 Mortgage 1.387149 Securities_Account 1.273257 CD_Account 1.415578 Online 2.352687 CreditCard 1.535435 Family_2 1.772369 Family_3 1.597993 Family_4 1.720229 Education_2 1.660475 Education_3 1.618522 County_Butte County 1.021538 County_Contra Costa County 1.084689 County_El Dorado County 1.020408 County_Fresno County 1.021808 County_Humboldt County 1.032392 County_Imperial County 1.005384 County_Kern County 1.051344 County_Lake County 1.009358 County_Los Angeles County 2.054082 County_Marin County 1.056114 County_Mendocino County 1.016735 County_Merced County 1.006955 County_Monterey County 1.134874 County_Napa County 1.004927 County_Orange County 1.322162 County_Placer County 1.026432 County_Riverside County 1.042304 County_Sacramento County 1.176115 County_San Benito County 1.020926 County_San Bernardino County 1.093543 County_San Diego County 1.523292 County_San Francisco County 1.239989 County_San Joaquin County 1.011318 County_San Luis Obispo County 1.032106 County_San Mateo County 1.174668 County_Santa Barbara County 1.133585 County_Santa Clara County 1.551532 County_Santa Cruz County 1.067146 County_Shasta County 1.013623 County_Siskiyou County 1.010453 County_Solano County 1.036458 County_Sonoma County 1.041163 County_Stanislaus County 1.016799 County_Trinity County 1.007336 County_Tuolumne County 1.006667 County_Ventura County 1.118208 County_Yolo County 1.121842 dtype: float64
X_train2 = X_train1.drop(
[
"Income",
],
axis=1,
)
logit2 = sm.Logit(y_train, X_train2.astype(float))
lg2 = logit2.fit(disp=False)
print(lg2.summary())
--------------------------------------------------------------------------- LinAlgError Traceback (most recent call last) <ipython-input-91-346717d7fd92> in <module> 1 logit2 = sm.Logit(y_train, X_train2.astype(float)) ----> 2 lg2 = logit2.fit(disp=False) 3 print(lg2.summary()) C:\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 1978 disp=disp, 1979 callback=callback, -> 1980 **kwargs) 1981 1982 discretefit = LogitResults(self, bnryfit) C:\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 231 disp=disp, 232 callback=callback, --> 233 **kwargs) 234 235 return mlefit # It is up to subclasses to wrap results C:\Anaconda\lib\site-packages\statsmodels\base\model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs) 532 Hinv = cov_params_func(self, xopt, retvals) 533 elif method == 'newton' and full_output: --> 534 Hinv = np.linalg.inv(-retvals['Hessian']) / nobs 535 elif not skip_hessian: 536 H = -1 * self.hessian(xopt) <__array_function__ internals> in inv(*args, **kwargs) C:\Anaconda\lib\site-packages\numpy\linalg\linalg.py in inv(a) 544 signature = 'D->D' if isComplexType(t) else 'd->d' 545 extobj = get_linalg_error_extobj(_raise_linalgerror_singular) --> 546 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj) 547 return wrap(ainv.astype(result_t, copy=False)) 548 C:\Anaconda\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_singular(err, flag) 86 87 def _raise_linalgerror_singular(err, flag): ---> 88 raise LinAlgError("Singular matrix") 89 90 def _raise_linalgerror_nonposdef(err, flag): LinAlgError: Singular matrix
vif_series = pd.Series(
[variance_inflation_factor(X_train2.values, i) for i in range(X_train2.shape[1])],
index=X_train2.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: CCAvg 2.171652 Mortgage 1.336767 Securities_Account 1.272333 CD_Account 1.406264 Online 2.318033 CreditCard 1.534374 Family_2 1.745180 Family_3 1.594173 Family_4 1.719900 Education_2 1.658458 Education_3 1.617244 County_Butte County 1.020916 County_Contra Costa County 1.073759 County_El Dorado County 1.017994 County_Fresno County 1.019608 County_Humboldt County 1.030327 County_Imperial County 1.005337 County_Kern County 1.047412 County_Lake County 1.009210 County_Los Angeles County 1.906090 County_Marin County 1.051518 County_Mendocino County 1.011811 County_Merced County 1.001601 County_Monterey County 1.123370 County_Napa County 1.004823 County_Orange County 1.279979 County_Placer County 1.022235 County_Riverside County 1.038077 County_Sacramento County 1.161371 County_San Benito County 1.020065 County_San Bernardino County 1.078721 County_San Diego County 1.462459 County_San Francisco County 1.218363 County_San Joaquin County 1.010181 County_San Luis Obispo County 1.028325 County_San Mateo County 1.152013 County_Santa Barbara County 1.118418 County_Santa Clara County 1.478172 County_Santa Cruz County 1.059674 County_Shasta County 1.012433 County_Siskiyou County 1.009995 County_Solano County 1.035361 County_Sonoma County 1.040373 County_Stanislaus County 1.016679 County_Trinity County 1.006726 County_Tuolumne County 1.006642 County_Ventura County 1.107293 County_Yolo County 1.112656 dtype: float64
X_train2 = X_train1.drop(
[
"Online",
],
axis=1,
)
logit2 = sm.Logit(y_train, X_train2.astype(float))
lg2 = logit2.fit(disp=False)
print(lg2.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 3500
Model: Logit Df Residuals: 3452
Method: MLE Df Model: 47
Date: Sat, 31 Jul 2021 Pseudo R-squ.: 0.1590
Time: 09:10:32 Log-Likelihood: -921.29
converged: False LL-Null: -1095.5
Covariance Type: nonrobust LLR p-value: 1.291e-47
=================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------
Income 0.0124 0.002 8.198 0.000 0.009 0.015
CCAvg 0.1002 0.039 2.554 0.011 0.023 0.177
Mortgage 6.492e-05 0.001 0.121 0.904 -0.001 0.001
Securities_Account -1.9361 0.271 -7.142 0.000 -2.467 -1.405
CD_Account 3.9552 0.272 14.535 0.000 3.422 4.488
CreditCard -1.9556 0.196 -9.959 0.000 -2.340 -1.571
Family_2 -1.6990 0.175 -9.721 0.000 -2.042 -1.356
Family_3 -0.7198 0.157 -4.578 0.000 -1.028 -0.412
Family_4 -0.9115 0.161 -5.648 0.000 -1.228 -0.595
Education_2 -0.0231 0.145 -0.159 0.874 -0.307 0.261
Education_3 0.0431 0.138 0.313 0.755 -0.227 0.313
County_Butte County -20.9663 1.12e+04 -0.002 0.999 -2.19e+04 2.19e+04
County_Contra Costa County -2.5722 0.468 -5.493 0.000 -3.490 -1.654
County_El Dorado County -2.9667 1.321 -2.246 0.025 -5.555 -0.378
County_Fresno County -3.1615 1.061 -2.980 0.003 -5.241 -1.082
County_Humboldt County -3.0604 1.048 -2.920 0.003 -5.114 -1.006
County_Imperial County -12.0852 565.137 -0.021 0.983 -1119.733 1095.563
County_Kern County -2.0156 0.499 -4.043 0.000 -2.993 -1.038
County_Lake County -35.2006 2.33e+07 -1.51e-06 1.000 -4.56e+07 4.56e+07
County_Los Angeles County -2.8588 0.181 -15.801 0.000 -3.213 -2.504
County_Marin County -2.3909 0.586 -4.079 0.000 -3.540 -1.242
County_Mendocino County -3.0831 1.116 -2.762 0.006 -5.271 -0.895
County_Merced County -18.3371 4453.729 -0.004 0.997 -8747.486 8710.812
County_Monterey County -2.4788 0.382 -6.489 0.000 -3.228 -1.730
County_Napa County -21.2073 1.84e+04 -0.001 0.999 -3.62e+04 3.61e+04
County_Orange County -3.1444 0.292 -10.756 0.000 -3.717 -2.571
County_Placer County -2.1944 0.829 -2.648 0.008 -3.819 -0.570
County_Riverside County -2.8076 0.649 -4.323 0.000 -4.081 -1.535
County_Sacramento County -2.8763 0.351 -8.193 0.000 -3.564 -2.188
County_San Benito County -26.0255 1.38e+05 -0.000 1.000 -2.71e+05 2.71e+05
County_San Bernardino County -4.7555 1.027 -4.632 0.000 -6.768 -2.743
County_San Diego County -2.5879 0.217 -11.903 0.000 -3.014 -2.162
County_San Francisco County -2.9468 0.318 -9.266 0.000 -3.570 -2.323
County_San Joaquin County -2.0757 1.214 -1.709 0.087 -4.456 0.304
County_San Luis Obispo County -2.2871 0.751 -3.044 0.002 -3.760 -0.815
County_San Mateo County -3.5619 0.429 -8.299 0.000 -4.403 -2.721
County_Santa Barbara County -2.6276 0.383 -6.857 0.000 -3.379 -1.877
County_Santa Clara County -2.5941 0.222 -11.690 0.000 -3.029 -2.159
County_Santa Cruz County -2.5824 0.513 -5.037 0.000 -3.587 -1.578
County_Shasta County -3.4015 1.517 -2.242 0.025 -6.376 -0.428
County_Siskiyou County -2.6804 1.929 -1.390 0.165 -6.460 1.100
County_Solano County -3.0711 0.822 -3.736 0.000 -4.682 -1.460
County_Sonoma County -1.7168 0.629 -2.727 0.006 -2.951 -0.483
County_Stanislaus County -19.2505 6531.641 -0.003 0.998 -1.28e+04 1.28e+04
County_Trinity County -18.4914 5483.322 -0.003 0.997 -1.08e+04 1.07e+04
County_Tuolumne County -28.8508 1.29e+06 -2.24e-05 1.000 -2.53e+06 2.53e+06
County_Ventura County -2.5598 0.425 -6.017 0.000 -3.394 -1.726
County_Yolo County -3.4519 0.546 -6.318 0.000 -4.523 -2.381
=================================================================================================
X_train3 = X_train2.drop(
[
"County_Tuolumne County",
"County_San Benito County",
"County_Lake County",
],
axis=1,
)
logit3 = sm.Logit(y_train, X_train3.astype(float))
lg3 = logit3.fit(disp=False)
print(lg3.summary())
--------------------------------------------------------------------------- LinAlgError Traceback (most recent call last) <ipython-input-96-b42f9bdf3346> in <module> 1 logit3 = sm.Logit(y_train, X_train3.astype(float)) ----> 2 lg3 = logit3.fit(disp=False) 3 print(lg3.summary()) C:\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 1978 disp=disp, 1979 callback=callback, -> 1980 **kwargs) 1981 1982 discretefit = LogitResults(self, bnryfit) C:\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 231 disp=disp, 232 callback=callback, --> 233 **kwargs) 234 235 return mlefit # It is up to subclasses to wrap results C:\Anaconda\lib\site-packages\statsmodels\base\model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs) 532 Hinv = cov_params_func(self, xopt, retvals) 533 elif method == 'newton' and full_output: --> 534 Hinv = np.linalg.inv(-retvals['Hessian']) / nobs 535 elif not skip_hessian: 536 H = -1 * self.hessian(xopt) <__array_function__ internals> in inv(*args, **kwargs) C:\Anaconda\lib\site-packages\numpy\linalg\linalg.py in inv(a) 544 signature = 'D->D' if isComplexType(t) else 'd->d' 545 extobj = get_linalg_error_extobj(_raise_linalgerror_singular) --> 546 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj) 547 return wrap(ainv.astype(result_t, copy=False)) 548 C:\Anaconda\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_singular(err, flag) 86 87 def _raise_linalgerror_singular(err, flag): ---> 88 raise LinAlgError("Singular matrix") 89 90 def _raise_linalgerror_nonposdef(err, flag): LinAlgError: Singular matrix
vif_series = pd.Series(
[variance_inflation_factor(X_train3.values, i) for i in range(X_train3.shape[1])],
index=X_train3.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: Income 5.972958 CCAvg 3.925342 Mortgage 1.383419 Securities_Account 1.270937 CD_Account 1.384173 CreditCard 1.530634 Family_2 1.753793 Family_3 1.577504 Family_4 1.688133 Education_2 1.637637 Education_3 1.608395 County_Butte County 1.015615 County_Contra Costa County 1.069878 County_El Dorado County 1.018845 County_Fresno County 1.019807 County_Humboldt County 1.029027 County_Imperial County 1.004026 County_Kern County 1.043758 County_Los Angeles County 1.952123 County_Marin County 1.052912 County_Mendocino County 1.015574 County_Merced County 1.006709 County_Monterey County 1.123508 County_Napa County 1.004019 County_Orange County 1.294043 County_Placer County 1.022886 County_Riverside County 1.038579 County_Sacramento County 1.151663 County_San Bernardino County 1.082628 County_San Diego County 1.477075 County_San Francisco County 1.216569 County_San Joaquin County 1.009050 County_San Luis Obispo County 1.030204 County_San Mateo County 1.165619 County_Santa Barbara County 1.121654 County_Santa Clara County 1.506802 County_Santa Cruz County 1.063263 County_Shasta County 1.012116 County_Siskiyou County 1.007139 County_Solano County 1.033071 County_Sonoma County 1.038125 County_Stanislaus County 1.015768 County_Trinity County 1.007279 County_Ventura County 1.106240 County_Yolo County 1.103357 dtype: float64
X_train3 = X_train2.drop(
[
"Income",
],
axis=1,
)
logit3 = sm.Logit(y_train, X_train3.astype(float))
lg3 = logit3.fit(disp=False)
print(lg3.summary())
--------------------------------------------------------------------------- LinAlgError Traceback (most recent call last) <ipython-input-99-b42f9bdf3346> in <module> 1 logit3 = sm.Logit(y_train, X_train3.astype(float)) ----> 2 lg3 = logit3.fit(disp=False) 3 print(lg3.summary()) C:\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 1978 disp=disp, 1979 callback=callback, -> 1980 **kwargs) 1981 1982 discretefit = LogitResults(self, bnryfit) C:\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 231 disp=disp, 232 callback=callback, --> 233 **kwargs) 234 235 return mlefit # It is up to subclasses to wrap results C:\Anaconda\lib\site-packages\statsmodels\base\model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs) 532 Hinv = cov_params_func(self, xopt, retvals) 533 elif method == 'newton' and full_output: --> 534 Hinv = np.linalg.inv(-retvals['Hessian']) / nobs 535 elif not skip_hessian: 536 H = -1 * self.hessian(xopt) <__array_function__ internals> in inv(*args, **kwargs) C:\Anaconda\lib\site-packages\numpy\linalg\linalg.py in inv(a) 544 signature = 'D->D' if isComplexType(t) else 'd->d' 545 extobj = get_linalg_error_extobj(_raise_linalgerror_singular) --> 546 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj) 547 return wrap(ainv.astype(result_t, copy=False)) 548 C:\Anaconda\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_singular(err, flag) 86 87 def _raise_linalgerror_singular(err, flag): ---> 88 raise LinAlgError("Singular matrix") 89 90 def _raise_linalgerror_nonposdef(err, flag): LinAlgError: Singular matrix
vif_series = pd.Series(
[variance_inflation_factor(X_train3.values, i) for i in range(X_train3.shape[1])],
index=X_train3.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: CCAvg 2.152986 Mortgage 1.336222 Securities_Account 1.272205 CD_Account 1.373352 CreditCard 1.534139 Family_2 1.726247 Family_3 1.573291 Family_4 1.694734 Education_2 1.647720 Education_3 1.612918 County_Butte County 1.014736 County_Contra Costa County 1.057030 County_El Dorado County 1.016267 County_Fresno County 1.017362 County_Humboldt County 1.026839 County_Imperial County 1.003974 County_Kern County 1.039274 County_Lake County 1.008325 County_Los Angeles County 1.793411 County_Marin County 1.048517 County_Mendocino County 1.010269 County_Merced County 1.001569 County_Monterey County 1.111825 County_Napa County 1.004025 County_Orange County 1.249647 County_Placer County 1.018102 County_Riverside County 1.034253 County_Sacramento County 1.135226 County_San Benito County 1.016521 County_San Bernardino County 1.066135 County_San Diego County 1.412769 County_San Francisco County 1.194130 County_San Joaquin County 1.007669 County_San Luis Obispo County 1.026271 County_San Mateo County 1.143520 County_Santa Barbara County 1.105874 County_Santa Clara County 1.429975 County_Santa Cruz County 1.056055 County_Shasta County 1.010743 County_Siskiyou County 1.006490 County_Solano County 1.032124 County_Sonoma County 1.037706 County_Stanislaus County 1.015814 County_Trinity County 1.006704 County_Tuolumne County 1.005216 County_Ventura County 1.094809 County_Yolo County 1.092933 dtype: float64
X_train3 = X_train2.drop(
[
"CCAvg",
],
axis=1,
)
logit3 = sm.Logit(y_train, X_train3.astype(float))
lg3 = logit3.fit(disp=False)
print(lg3.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 3500
Model: Logit Df Residuals: 3453
Method: MLE Df Model: 46
Date: Sat, 31 Jul 2021 Pseudo R-squ.: 0.1563
Time: 09:10:40 Log-Likelihood: -924.23
converged: False LL-Null: -1095.5
Covariance Type: nonrobust LLR p-value: 6.045e-47
=================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------
Income 0.0149 0.001 12.852 0.000 0.013 0.017
Mortgage -1.35e-05 0.001 -0.025 0.980 -0.001 0.001
Securities_Account -1.9269 0.271 -7.121 0.000 -2.457 -1.397
CD_Account 3.9741 0.272 14.595 0.000 3.440 4.508
CreditCard -1.9517 0.196 -9.932 0.000 -2.337 -1.567
Family_2 -1.6748 0.174 -9.643 0.000 -2.015 -1.334
Family_3 -0.7125 0.157 -4.536 0.000 -1.020 -0.405
Family_4 -0.9059 0.161 -5.612 0.000 -1.222 -0.590
Education_2 -0.0153 0.145 -0.105 0.916 -0.300 0.269
Education_3 0.0456 0.137 0.332 0.740 -0.224 0.315
County_Butte County -22.9056 3.02e+04 -0.001 0.999 -5.92e+04 5.91e+04
County_Contra Costa County -2.5756 0.470 -5.477 0.000 -3.497 -1.654
County_El Dorado County -2.9026 1.293 -2.245 0.025 -5.437 -0.368
County_Fresno County -3.1792 1.059 -3.001 0.003 -5.255 -1.103
County_Humboldt County -3.0287 1.048 -2.891 0.004 -5.082 -0.975
County_Imperial County -23.2313 1.44e+05 -0.000 1.000 -2.82e+05 2.82e+05
County_Kern County -1.9863 0.494 -4.017 0.000 -2.955 -1.017
County_Lake County -23.5166 6.52e+04 -0.000 1.000 -1.28e+05 1.28e+05
County_Los Angeles County -2.8635 0.181 -15.813 0.000 -3.218 -2.509
County_Marin County -2.3890 0.587 -4.070 0.000 -3.539 -1.239
County_Mendocino County -3.1707 1.120 -2.832 0.005 -5.365 -0.976
County_Merced County -28.6952 7.07e+05 -4.06e-05 1.000 -1.39e+06 1.39e+06
County_Monterey County -2.4527 0.382 -6.423 0.000 -3.201 -1.704
County_Napa County -18.4462 4938.004 -0.004 0.997 -9696.756 9659.863
County_Orange County -3.1377 0.292 -10.763 0.000 -3.709 -2.566
County_Placer County -2.2935 0.838 -2.737 0.006 -3.936 -0.651
County_Riverside County -2.7843 0.644 -4.321 0.000 -4.047 -1.521
County_Sacramento County -2.8537 0.349 -8.173 0.000 -3.538 -2.169
County_San Benito County -12.5362 170.828 -0.073 0.941 -347.353 322.280
County_San Bernardino County -4.7501 1.024 -4.639 0.000 -6.757 -2.743
County_San Diego County -2.5795 0.217 -11.897 0.000 -3.004 -2.155
County_San Francisco County -2.9334 0.317 -9.245 0.000 -3.555 -2.312
County_San Joaquin County -2.0978 1.218 -1.723 0.085 -4.484 0.289
County_San Luis Obispo County -2.2651 0.759 -2.985 0.003 -3.753 -0.778
County_San Mateo County -3.5485 0.428 -8.288 0.000 -4.388 -2.709
County_Santa Barbara County -2.6052 0.382 -6.818 0.000 -3.354 -1.856
County_Santa Clara County -2.6050 0.222 -11.745 0.000 -3.040 -2.170
County_Santa Cruz County -2.5685 0.515 -4.988 0.000 -3.578 -1.559
County_Shasta County -3.2760 1.414 -2.317 0.020 -6.047 -0.505
County_Siskiyou County -11.3881 126.761 -0.090 0.928 -259.835 237.059
County_Solano County -3.0196 0.817 -3.695 0.000 -4.621 -1.418
County_Sonoma County -1.6156 0.615 -2.625 0.009 -2.822 -0.409
County_Stanislaus County -20.4455 1.19e+04 -0.002 0.999 -2.33e+04 2.32e+04
County_Trinity County -22.2116 3.53e+04 -0.001 0.999 -6.92e+04 6.92e+04
County_Tuolumne County -22.7094 5.76e+04 -0.000 1.000 -1.13e+05 1.13e+05
County_Ventura County -2.5204 0.418 -6.027 0.000 -3.340 -1.701
County_Yolo County -3.4481 0.549 -6.283 0.000 -4.524 -2.372
=================================================================================================
I feel like I need to start over with the original df dataframe and then try dropping other variables. I am not sure how I would do that. Would I just do another xtrain and equal it to df? I will try this if I have time but I only have 2 hours left and need to do the decision tree.
df4 = df.copy()
df4.drop(["ZIPCode", "City", "Longitude", "Latitude"], axis=1, inplace=True)
df4 = pd.get_dummies(df4)
df4.head(10)
| Age | Experience | Income | CCAvg | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Family_1 | Family_2 | Family_3 | Family_4 | Education_1 | Education_2 | Education_3 | Age_Bins_Age_20to29 | Age_Bins_Age_30to39 | Age_Bins_Age_40to49 | Age_Bins_Age_50to59 | Age_Bins_Age_60to69 | Exp_Bins_Exp_0to10 | Exp_Bins_Exp_11to20 | Exp_Bins_Exp_21to30 | Exp_Bins_Exp_30toPlus | Inc_Bins_Inc_0to25K | Inc_Bins_Inc_100K_Plus | Inc_Bins_Inc_26to50K | Inc_Bins_Inc_51to75K | Inc_Bins_Inc_76to100K | County_Alameda County | County_Butte County | County_Contra Costa County | County_El Dorado County | County_Fresno County | County_Humboldt County | County_Imperial County | County_Kern County | County_Lake County | County_Los Angeles County | County_Marin County | County_Mendocino County | County_Merced County | County_Monterey County | County_Napa County | County_Orange County | County_Placer County | County_Riverside County | County_Sacramento County | County_San Benito County | County_San Bernardino County | County_San Diego County | County_San Francisco County | County_San Joaquin County | County_San Luis Obispo County | County_San Mateo County | County_Santa Barbara County | County_Santa Clara County | County_Santa Cruz County | County_Shasta County | County_Siskiyou County | County_Solano County | County_Sonoma County | County_Stanislaus County | County_Trinity County | County_Tuolumne County | County_Ventura County | County_Yolo County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 1.6 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 1.5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 2.7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 1.0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 37 | 13 | 29 | 0.4 | 155 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 53 | 27 | 72 | 1.5 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 50 | 24 | 22 | 0.3 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 35 | 10 | 81 | 0.6 | 104 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 34 | 9 | 180 | 8.9 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
X = df4.drop("Personal_Loan", axis=1)
y = df4.pop("Personal_Loan")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1
)
dTree = DecisionTreeClassifier(criterion="gini", random_state=1)
dTree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)
print("Accuracy on training set : ", dTree.score(X_train, y_train))
print("Accuracy on test set : ", dTree.score(X_test, y_test))
Accuracy on training set : 1.0 Accuracy on test set : 0.976
# Checking number of positives
y.sum(axis=0)
480
What does a bank want?
Which loss is greater ?
Since we want people to purchase a loans we should use Recall as a metric of model evaluation instead of accuracy.
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
## Function to create confusion matrix
def make_confusion_matrix(model, y_actual, labels=[1, 0]):
"""
model : classifier to predict values of X
y_actual : ground truth
"""
y_predict = model.predict(X_test)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(
cm,
index=[i for i in ["Actual - No", "Actual - Yes"]],
columns=[i for i in ["Predicted - No", "Predicted - Yes"]],
)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(10, 7))
sns.heatmap(df_cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
## Function to calculate recall score
def get_recall_score(model):
"""
model : classifier to predict values of X
"""
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ", metrics.recall_score(y_train, pred_train))
print("Recall on test set : ", metrics.recall_score(y_test, pred_test))
make_confusion_matrix(dTree, y_test)
# Recall on train and test
get_recall_score(dTree)
Recall on training set : 1.0 Recall on test set : 0.8590604026845637
feature_names = list(X.columns)
print(feature_names)
['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'Family_1', 'Family_2', 'Family_3', 'Family_4', 'Education_1', 'Education_2', 'Education_3', 'Age_Bins_Age_20to29', 'Age_Bins_Age_30to39', 'Age_Bins_Age_40to49', 'Age_Bins_Age_50to59', 'Age_Bins_Age_60to69', 'Exp_Bins_Exp_0to10', 'Exp_Bins_Exp_11to20', 'Exp_Bins_Exp_21to30', 'Exp_Bins_Exp_30toPlus', 'Inc_Bins_Inc_0to25K', 'Inc_Bins_Inc_100K_Plus', 'Inc_Bins_Inc_26to50K', 'Inc_Bins_Inc_51to75K', 'Inc_Bins_Inc_76to100K', 'County_Alameda County', 'County_Butte County', 'County_Contra Costa County', 'County_El Dorado County', 'County_Fresno County', 'County_Humboldt County', 'County_Imperial County', 'County_Kern County', 'County_Lake County', 'County_Los Angeles County', 'County_Marin County', 'County_Mendocino County', 'County_Merced County', 'County_Monterey County', 'County_Napa County', 'County_Orange County', 'County_Placer County', 'County_Riverside County', 'County_Sacramento County', 'County_San Benito County', 'County_San Bernardino County', 'County_San Diego County', 'County_San Francisco County', 'County_San Joaquin County', 'County_San Luis Obispo County', 'County_San Mateo County', 'County_Santa Barbara County', 'County_Santa Clara County', 'County_Santa Cruz County', 'County_Shasta County', 'County_Siskiyou County', 'County_Solano County', 'County_Sonoma County', 'County_Stanislaus County', 'County_Trinity County', 'County_Tuolumne County', 'County_Ventura County', 'County_Yolo County']
plt.figure(figsize=(20, 30))
tree.plot_tree(
dTree,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(dTree, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family_4 <= 0.50 | | | | |--- County_Santa Barbara County <= 0.50 | | | | | |--- Age <= 28.50 | | | | | | |--- Education_2 <= 0.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- Education_2 > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Age > 28.50 | | | | | | |--- CCAvg <= 2.20 | | | | | | | |--- weights: [48.00, 0.00] class: 0 | | | | | | |--- CCAvg > 2.20 | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | |--- Education_3 > 0.50 | | | | | | | | |--- Age_Bins_Age_30to39 <= 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age_Bins_Age_30to39 > 0.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- County_Santa Barbara County > 0.50 | | | | | |--- Family_2 <= 0.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Family_2 > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- Family_4 > 0.50 | | | | |--- Exp_Bins_Exp_21to30 <= 0.50 | | | | | |--- CCAvg <= 2.45 | | | | | | |--- CCAvg <= 1.75 | | | | | | | |--- Experience <= 36.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- Experience > 36.00 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CCAvg > 1.75 | | | | | | | |--- weights: [15.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.45 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Exp_Bins_Exp_21to30 > 0.50 | | | | | |--- weights: [0.00, 3.00] class: 1 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- County_Riverside County <= 0.50 | | | | | |--- Experience <= 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Experience > 0.50 | | | | | | |--- County_San Francisco County <= 0.50 | | | | | | | |--- Age <= 62.50 | | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | | |--- County_Los Angeles County <= 0.50 | | | | | | | | | | |--- weights: [73.00, 0.00] class: 0 | | | | | | | | | |--- County_Los Angeles County > 0.50 | | | | | | | | | | |--- CCAvg <= 3.55 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- CCAvg > 3.55 | | | | | | | | | | | |--- weights: [16.00, 0.00] class: 0 | | | | | | | | |--- Education_2 > 0.50 | | | | | | | | | |--- Age <= 36.50 | | | | | | | | | | |--- County_Ventura County <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- County_Ventura County > 0.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- Age > 36.50 | | | | | | | | | | |--- County_Alameda County <= 0.50 | | | | | | | | | | | |--- weights: [15.00, 0.00] class: 0 | | | | | | | | | | |--- County_Alameda County > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- Age > 62.50 | | | | | | | | |--- Income <= 84.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Income > 84.00 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- County_San Francisco County > 0.50 | | | | | | | |--- Income <= 82.50 | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | |--- Income > 82.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- County_Riverside County > 0.50 | | | | | |--- Family_4 <= 0.50 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Family_4 > 0.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education_1 <= 0.50 | | | | |--- Age <= 63.50 | | | | | |--- Mortgage <= 172.00 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Experience <= 35.50 | | | | | | | | |--- weights: [0.00, 21.00] class: 1 | | | | | | | |--- Experience > 35.50 | | | | | | | | |--- Family_1 <= 0.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- Family_1 > 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- Income <= 98.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Income > 98.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Mortgage > 172.00 | | | | | | |--- Exp_Bins_Exp_11to20 <= 0.50 | | | | | | | |--- Inc_Bins_Inc_100K_Plus <= 0.50 | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | |--- Inc_Bins_Inc_100K_Plus > 0.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Exp_Bins_Exp_11to20 > 0.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Age > 63.50 | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- Education_1 > 0.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- Family_4 <= 0.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- Age <= 55.00 | | | | | | | | |--- Family_3 <= 0.50 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | |--- Family_3 > 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 55.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | |--- Family_4 > 0.50 | | | | | | |--- Age_Bins_Age_50to59 <= 0.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Age_Bins_Age_50to59 > 0.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 |--- Income > 116.50 | |--- Education_1 <= 0.50 | | |--- weights: [0.00, 222.00] class: 1 | |--- Education_1 > 0.50 | | |--- Family_3 <= 0.50 | | | |--- Family_4 <= 0.50 | | | | |--- weights: [375.00, 0.00] class: 0 | | | |--- Family_4 > 0.50 | | | | |--- weights: [0.00, 14.00] class: 1 | | |--- Family_3 > 0.50 | | | |--- weights: [0.00, 33.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
dTree.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education_1 0.400952 Income 0.309494 Family_3 0.097406 Family_4 0.054470 CCAvg 0.049703 CD_Account 0.025711 Age 0.012351 Experience 0.009061 Exp_Bins_Exp_21to30 0.005571 Inc_Bins_Inc_100K_Plus 0.004767 Education_2 0.003849 Exp_Bins_Exp_11to20 0.003575 County_Riverside County 0.003544 Mortgage 0.003014 Age_Bins_Age_50to59 0.002224 Family_1 0.002224 County_Ventura County 0.002224 Family_2 0.001668 Age_Bins_Age_30to39 0.001668 County_San Francisco County 0.001608 County_Alameda County 0.001472 County_Santa Barbara County 0.001422 Education_3 0.001335 Online 0.000561 County_Los Angeles County 0.000123 County_San Bernardino County 0.000000 County_San Diego County 0.000000 County_San Luis Obispo County 0.000000 County_San Benito County 0.000000 County_Sacramento County 0.000000 County_Placer County 0.000000 County_San Joaquin County 0.000000 County_Santa Clara County 0.000000 County_San Mateo County 0.000000 County_Santa Cruz County 0.000000 County_Shasta County 0.000000 County_Siskiyou County 0.000000 County_Napa County 0.000000 County_Solano County 0.000000 County_Sonoma County 0.000000 County_Stanislaus County 0.000000 County_Trinity County 0.000000 County_Tuolumne County 0.000000 County_Orange County 0.000000 County_Fresno County 0.000000 County_Monterey County 0.000000 Inc_Bins_Inc_51to75K 0.000000 Securities_Account 0.000000 CreditCard 0.000000 Age_Bins_Age_20to29 0.000000 Age_Bins_Age_40to49 0.000000 Age_Bins_Age_60to69 0.000000 Exp_Bins_Exp_0to10 0.000000 Exp_Bins_Exp_30toPlus 0.000000 Inc_Bins_Inc_0to25K 0.000000 Inc_Bins_Inc_26to50K 0.000000 Inc_Bins_Inc_76to100K 0.000000 County_Merced County 0.000000 County_Butte County 0.000000 County_Contra Costa County 0.000000 County_El Dorado County 0.000000 County_Humboldt County 0.000000 County_Imperial County 0.000000 County_Kern County 0.000000 County_Lake County 0.000000 County_Marin County 0.000000 County_Mendocino County 0.000000 County_Yolo County 0.000000
importances = dTree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 50))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
dTree1 = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=1)
dTree1.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=1)
make_confusion_matrix(dTree1, y_test)
# Accuracy on train and test
print("Accuracy on training set : ", dTree1.score(X_train, y_train))
print("Accuracy on test set : ", dTree1.score(X_test, y_test))
# Recall on train and test
get_recall_score(dTree1)
Accuracy on training set : 0.9782857142857143 Accuracy on test set : 0.9646666666666667 Recall on training set : 0.770392749244713 Recall on test set : 0.6442953020134228
plt.figure(figsize=(15, 10))
tree.plot_tree(
dTree1,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(dTree1, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- weights: [79.00, 10.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- weights: [117.00, 15.00] class: 0 | | |--- Income > 92.50 | | | |--- weights: [45.00, 37.00] class: 0 |--- Income > 116.50 | |--- Education_1 <= 0.50 | | |--- weights: [0.00, 222.00] class: 1 | |--- Education_1 > 0.50 | | |--- Family_3 <= 0.50 | | | |--- weights: [375.00, 14.00] class: 0 | | |--- Family_3 > 0.50 | | | |--- weights: [0.00, 33.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
dTree1.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education_1 0.471323 Income 0.366211 Family_3 0.115989 CCAvg 0.046476 Age 0.000000 County_San Bernardino County 0.000000 County_San Benito County 0.000000 County_Sacramento County 0.000000 County_Riverside County 0.000000 County_Placer County 0.000000 County_Orange County 0.000000 County_Napa County 0.000000 County_San Francisco County 0.000000 County_Monterey County 0.000000 County_Merced County 0.000000 County_Mendocino County 0.000000 County_Marin County 0.000000 County_Los Angeles County 0.000000 County_Lake County 0.000000 County_San Diego County 0.000000 County_San Luis Obispo County 0.000000 County_San Joaquin County 0.000000 County_Imperial County 0.000000 County_San Mateo County 0.000000 County_Santa Barbara County 0.000000 County_Santa Clara County 0.000000 County_Santa Cruz County 0.000000 County_Shasta County 0.000000 County_Siskiyou County 0.000000 County_Solano County 0.000000 County_Sonoma County 0.000000 County_Stanislaus County 0.000000 County_Trinity County 0.000000 County_Tuolumne County 0.000000 County_Ventura County 0.000000 County_Kern County 0.000000 County_Fresno County 0.000000 County_Humboldt County 0.000000 Age_Bins_Age_50to59 0.000000 Mortgage 0.000000 Securities_Account 0.000000 CD_Account 0.000000 Online 0.000000 CreditCard 0.000000 Family_1 0.000000 Family_2 0.000000 Family_4 0.000000 Education_2 0.000000 Education_3 0.000000 Age_Bins_Age_20to29 0.000000 Age_Bins_Age_30to39 0.000000 Age_Bins_Age_40to49 0.000000 Age_Bins_Age_60to69 0.000000 Experience 0.000000 Exp_Bins_Exp_0to10 0.000000 Exp_Bins_Exp_11to20 0.000000 Exp_Bins_Exp_21to30 0.000000 Exp_Bins_Exp_30toPlus 0.000000 Inc_Bins_Inc_0to25K 0.000000 Inc_Bins_Inc_100K_Plus 0.000000 Inc_Bins_Inc_26to50K 0.000000 Inc_Bins_Inc_51to75K 0.000000 Inc_Bins_Inc_76to100K 0.000000 County_Alameda County 0.000000 County_Butte County 0.000000 County_Contra Costa County 0.000000 County_El Dorado County 0.000000 County_Yolo County 0.000000
importances = dTree1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10, 50))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
"max_depth": np.arange(1, 10),
"min_samples_leaf": [1, 2, 5, 7, 10, 15, 20],
"max_leaf_nodes": [2, 3, 5, 10],
"min_impurity_decrease": [0.001, 0.01, 0.1],
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
min_impurity_decrease=0.001, random_state=1)
make_confusion_matrix(estimator, y_test)
# Accuracy on train and test
print("Accuracy on training set : ", estimator.score(X_train, y_train))
print("Accuracy on test set : ", estimator.score(X_test, y_test))
# Recall on train and test
get_recall_score(estimator)
Accuracy on training set : 0.9897142857142858 Accuracy on test set : 0.9813333333333333 Recall on training set : 0.9274924471299094 Recall on test set : 0.8791946308724832
plt.figure(figsize=(15, 10))
tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- weights: [2632.00, 10.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- weights: [117.00, 10.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education_1 <= 0.50 | | | | |--- weights: [11.00, 28.00] class: 1 | | | |--- Education_1 > 0.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- weights: [33.00, 4.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- weights: [1.00, 5.00] class: 1 |--- Income > 116.50 | |--- Education_1 <= 0.50 | | |--- weights: [0.00, 222.00] class: 1 | |--- Education_1 > 0.50 | | |--- Family_3 <= 0.50 | | | |--- Family_4 <= 0.50 | | | | |--- weights: [375.00, 0.00] class: 0 | | | |--- Family_4 > 0.50 | | | | |--- weights: [0.00, 14.00] class: 1 | | |--- Family_3 > 0.50 | | | |--- weights: [0.00, 33.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
# Here we will see that importance of features has increased
Imp Education_1 0.447999 Income 0.328713 Family_3 0.105394 Family_4 0.050317 CCAvg 0.042231 CD_Account 0.025345 Age 0.000000 County_San Bernardino County 0.000000 County_San Benito County 0.000000 County_Sacramento County 0.000000 County_Riverside County 0.000000 County_Placer County 0.000000 County_Orange County 0.000000 County_Napa County 0.000000 County_San Francisco County 0.000000 County_Monterey County 0.000000 County_Merced County 0.000000 County_Mendocino County 0.000000 County_Marin County 0.000000 County_Los Angeles County 0.000000 County_San Diego County 0.000000 County_Santa Barbara County 0.000000 County_San Joaquin County 0.000000 County_San Luis Obispo County 0.000000 County_San Mateo County 0.000000 County_Kern County 0.000000 County_Santa Clara County 0.000000 County_Santa Cruz County 0.000000 County_Shasta County 0.000000 County_Siskiyou County 0.000000 County_Solano County 0.000000 County_Sonoma County 0.000000 County_Stanislaus County 0.000000 County_Trinity County 0.000000 County_Tuolumne County 0.000000 County_Ventura County 0.000000 County_Lake County 0.000000 County_Fresno County 0.000000 County_Imperial County 0.000000 County_Humboldt County 0.000000 Mortgage 0.000000 Securities_Account 0.000000 Online 0.000000 CreditCard 0.000000 Family_1 0.000000 Family_2 0.000000 Education_2 0.000000 Education_3 0.000000 Age_Bins_Age_20to29 0.000000 Age_Bins_Age_30to39 0.000000 Age_Bins_Age_40to49 0.000000 Age_Bins_Age_50to59 0.000000 Age_Bins_Age_60to69 0.000000 Exp_Bins_Exp_0to10 0.000000 Exp_Bins_Exp_11to20 0.000000 Exp_Bins_Exp_21to30 0.000000 Exp_Bins_Exp_30toPlus 0.000000 Inc_Bins_Inc_0to25K 0.000000 Inc_Bins_Inc_100K_Plus 0.000000 Inc_Bins_Inc_26to50K 0.000000 Inc_Bins_Inc_51to75K 0.000000 Inc_Bins_Inc_76to100K 0.000000 County_Alameda County 0.000000 County_Butte County 0.000000 County_Contra Costa County 0.000000 County_El Dorado County 0.000000 Experience 0.000000 County_Yolo County 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 50))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
You can see in important features of previous model, employment_duration was lost, but here importance of employment_duration variable is back This shows that hyperparameter tuning using Grid Search is better than randomly limiting a Hyperparameter
But post pruning might give even better results, since there is quite a good possibility that we might neglect some hyperparameters, post pruning will take care of all that.
The DecisionTreeClassifier provides parameters such as
min_samples_leaf and max_depth to prevent a tree from overfiting. Cost
complexity pruning provides another option to control the size of a tree. In
DecisionTreeClassifier, this pruning technique is parameterized by the
cost complexity parameter, ccp_alpha. Greater values of ccp_alpha
increase the number of nodes pruned. Here we only show the effect of
ccp_alpha on regularizing the trees and how to choose a ccp_alpha
based on validation scores.
Minimal cost complexity pruning recursively finds the node with the "weakest
link". The weakest link is characterized by an effective alpha, where the
nodes with the smallest effective alpha are pruned first. To get an idea of
what values of ccp_alpha could be appropriate, scikit-learn provides
DecisionTreeClassifier.cost_complexity_pruning_path that returns the
effective alphas and the corresponding total leaf impurities at each step of
the pruning process. As alpha increases, more of the tree is pruned, which
increases the total impurity of its leaves.
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000187 | 0.000562 |
| 2 | 0.000188 | 0.001127 |
| 3 | 0.000269 | 0.002202 |
| 4 | 0.000269 | 0.002740 |
| 5 | 0.000326 | 0.004371 |
| 6 | 0.000359 | 0.005447 |
| 7 | 0.000381 | 0.005828 |
| 8 | 0.000381 | 0.006209 |
| 9 | 0.000381 | 0.006590 |
| 10 | 0.000381 | 0.006971 |
| 11 | 0.000381 | 0.007352 |
| 12 | 0.000476 | 0.007828 |
| 13 | 0.000514 | 0.009369 |
| 14 | 0.000582 | 0.009951 |
| 15 | 0.000593 | 0.011137 |
| 16 | 0.000607 | 0.011744 |
| 17 | 0.000635 | 0.012378 |
| 18 | 0.000641 | 0.014944 |
| 19 | 0.000760 | 0.017985 |
| 20 | 0.001552 | 0.019536 |
| 21 | 0.002333 | 0.021869 |
| 22 | 0.003024 | 0.024893 |
| 23 | 0.003294 | 0.028187 |
| 24 | 0.006473 | 0.034659 |
| 25 | 0.007712 | 0.042372 |
| 26 | 0.016154 | 0.058525 |
| 27 | 0.056365 | 0.171255 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
When ccp_alpha is set to zero and keeping the other default parameters
of DecisionTreeClassifier, the tree overfits, leading to
a 100% training accuracy and 69% testing accuracy. As alpha increases, more
of the tree is pruned, thus creating a decision tree that generalizes better.
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(10, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print("Training accuracy of best model: ", best_model.score(X_train, y_train))
print("Test accuracy of best model: ", best_model.score(X_test, y_test))
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415, random_state=1) Training accuracy of best model: 0.9911428571428571 Test accuracy of best model: 0.9813333333333333
recall_train = []
for clf in clfs:
pred_train3 = clf.predict(X_train)
values_train = metrics.recall_score(y_train, pred_train3)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test3 = clf.predict(X_test)
values_test = metrics.recall_score(y_test, pred_test3)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415, random_state=1)
make_confusion_matrix(best_model, y_test)
# Recall on train and test
get_recall_score(best_model)
Recall on training set : 0.9425981873111783 Recall on test set : 0.8791946308724832
plt.figure(figsize=(17, 15))
tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family_4 <= 0.50 | | | | |--- weights: [63.00, 3.00] class: 0 | | | |--- Family_4 > 0.50 | | | | |--- Exp_Bins_Exp_21to30 <= 0.50 | | | | | |--- CCAvg <= 2.45 | | | | | | |--- weights: [16.00, 2.00] class: 0 | | | | | |--- CCAvg > 2.45 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Exp_Bins_Exp_21to30 > 0.50 | | | | | |--- weights: [0.00, 3.00] class: 1 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- weights: [117.00, 10.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education_1 <= 0.50 | | | | |--- weights: [11.00, 28.00] class: 1 | | | |--- Education_1 > 0.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- weights: [33.00, 4.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- weights: [1.00, 5.00] class: 1 |--- Income > 116.50 | |--- Education_1 <= 0.50 | | |--- weights: [0.00, 222.00] class: 1 | |--- Education_1 > 0.50 | | |--- Family_3 <= 0.50 | | | |--- Family_4 <= 0.50 | | | | |--- weights: [375.00, 0.00] class: 0 | | | |--- Family_4 > 0.50 | | | | |--- weights: [0.00, 14.00] class: 1 | | |--- Family_3 > 0.50 | | | |--- weights: [0.00, 33.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
best_model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education_1 0.439285 Income 0.326289 Family_3 0.103344 Family_4 0.053517 CCAvg 0.046609 CD_Account 0.024852 Exp_Bins_Exp_21to30 0.006103 County_San Diego County 0.000000 County_San Bernardino County 0.000000 County_San Benito County 0.000000 County_Sacramento County 0.000000 County_Riverside County 0.000000 County_Placer County 0.000000 Age 0.000000 County_San Francisco County 0.000000 County_Napa County 0.000000 County_Monterey County 0.000000 County_Merced County 0.000000 County_Mendocino County 0.000000 County_Marin County 0.000000 County_Orange County 0.000000 County_Santa Barbara County 0.000000 County_San Joaquin County 0.000000 County_San Luis Obispo County 0.000000 County_San Mateo County 0.000000 County_Lake County 0.000000 County_Santa Clara County 0.000000 County_Santa Cruz County 0.000000 County_Shasta County 0.000000 County_Siskiyou County 0.000000 County_Solano County 0.000000 County_Sonoma County 0.000000 County_Stanislaus County 0.000000 County_Trinity County 0.000000 County_Tuolumne County 0.000000 County_Ventura County 0.000000 County_Los Angeles County 0.000000 County_Fresno County 0.000000 County_Kern County 0.000000 County_Imperial County 0.000000 Mortgage 0.000000 Securities_Account 0.000000 Online 0.000000 CreditCard 0.000000 Family_1 0.000000 Family_2 0.000000 Education_2 0.000000 Education_3 0.000000 Age_Bins_Age_20to29 0.000000 Age_Bins_Age_30to39 0.000000 Age_Bins_Age_40to49 0.000000 Age_Bins_Age_50to59 0.000000 Age_Bins_Age_60to69 0.000000 Exp_Bins_Exp_0to10 0.000000 Exp_Bins_Exp_11to20 0.000000 Exp_Bins_Exp_30toPlus 0.000000 Inc_Bins_Inc_0to25K 0.000000 Inc_Bins_Inc_100K_Plus 0.000000 Inc_Bins_Inc_26to50K 0.000000 Inc_Bins_Inc_51to75K 0.000000 Inc_Bins_Inc_76to100K 0.000000 County_Alameda County 0.000000 County_Butte County 0.000000 County_Contra Costa County 0.000000 County_El Dorado County 0.000000 Experience 0.000000 County_Humboldt County 0.000000 County_Yolo County 0.000000
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 50))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
comparison_frame = pd.DataFrame(
{
"Model": [
"Initial decision tree model",
"Decision tree with restricted maximum depth",
"Decision treee with hyperparameter tuning",
"Decision tree with post-pruning",
],
"Train_Recall": [1, 0.77, 0.92, 0.94],
"Test_Recall": [0.85, 0.64, 0.88, 0.88],
}
)
comparison_frame
| Model | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | Initial decision tree model | 1.00 | 0.85 |
| 1 | Decision tree with restricted maximum depth | 0.77 | 0.64 |
| 2 | Decision treee with hyperparameter tuning | 0.92 | 0.88 |
| 3 | Decision tree with post-pruning | 0.94 | 0.88 |
Decision tree with post-pruning is giving the highest recall on test set.
We can use these models to predict whether a customer will or will not buy a loan. In both models we were able to see that someone with an undergraduate education and a family with 3 to 4 members were the deciding factors to look at along with income on whether someone was going to purchase a loan or not. If I was going to evaluate this model further, I would want to use the map I created to pinpoint where these customer reside. Then I would want to look at the branches in those areas to run the campaign.
We were able to see there were some counties that had a low enough p-value to have some significance. I would start looking at those counties first and see if the variables of undergraduate education with 3 to 4 family members resided there. I also would want to rerun the logistic regression model before submitting a formal recommendation to the CEO/CFO. I believe I could build a stronger model by changing the variables I dropped.